ReBNN: Resilient Binary Neural Network
87
in which γn
i
is a balanced parameter. Based on the objective, the weight gradient in
Eq. (3.141) becomes:
δwn
i = ∂L
∂wn
i
+ γn
i (wn
i −αn
i bwn
i )
= αn
i ( ∂L
∂ˆwn
i
⊛1|wn
i |≤1 −γn
i bwn
i ) + γn
i wn
i .
(3.144)
The Sn
i (αn
i , wn
i ) = γn
i (wn
i −αn
i bwn
i ) is an additional term added in the backpropagation
process. We add this element because too small αn
i diminishes the gradient δwn
i and causes
a constant weight wn
i . In what follows, we state and prove the proposition that δwn
i,j is
a resilient gradient for a single weight wn
i,j. Sometimes we omit the subscript i, j and the
superscript n for an easy representation.
Proposition 1. The additional term S(α, w) = γ(w −αbw) achieves a resilient training
process by suppressing frequent weight oscillation. Its balanced factor γ can be considered
the parameter that controls the appearance of the weight oscillation.
Proof: We prove the proposition by contradiction. For a single weight w centering around
zero, the straight-through-estimator 1|w|≤1 = 1. Thus, we omit it in the following. Based
on Eq. (3.144), with a learning rate η, the weight updating process is formulated as:
wt+1 = wt −ηδwt
= wt −η[αt( ∂L
∂ˆwt −γbwt) + γwt]
= (1 −ηγ)wt −ηαt( ∂L
∂ˆwt −γbwt)
= (1 −ηγ)
wt −
ηαt
(1 −ηγ)( ∂L
∂ˆwt −γbwt)
,
(3.145)
where t denotes the t-th training iteration and η represents the learning rate. Different
weights share different distances from the quantization level ±1. Therefore, their gradients
should be modified according to their scaling factors and current learning rate. We first
assume the initial state bwt = −1, and the analysis process applies to the case of initial
state bwt = 1. The oscillation probability from iteration t to t + 1 is the following:
P(bwt ̸= bwt+1)
bwt=−1 ≤P( ∂L
∂ˆwt ≤−γ).
(3.146)
Similarly, the oscillation probability from iteration t + 1 to t + 2 is as follows:
P(bwt+1 ̸= bwt+2)
bwt+1=1 ≤P(
∂L
∂ˆwt+1 ≥γ).
(3.147)
Thus, the sequential oscillation probability from iteration t to t + 2 is as follows:
P((bwt+1 ̸= bwt+2) ∩(bwt+1 ̸= bwt+2))|bwt=−1
≤P
( ∂L
∂ˆwt ≤−γ) ∩(
∂L
∂ˆwt+1 ≥γ)
,
(3.148)
which denotes that the weight oscillation occurs only if the magnitudes of
∂L
∂ˆwt and
∂L
∂ˆwt+1
are more significant than γ. As a result, its attached factor γ can be considered a
parameter used to control the occurrence of the weight oscillation.